SemanticScuttle - klotz.me » Tags: production engineering

Tags: production engineering*

Production Engineering focuses on the design, implementation, and management of systems and processes to ensure the efficient and reliable delivery of software and services in a production environment. It involves various aspects such as deploying, monitoring, and maintaining applications, managing infrastructure, and handling data pipelines. Production Engineering KPIs include Availability and Cost.

0 bookmark(s) - Sort by: Date ↓ / Title /

How Nubank Built its in-house log platform

This article details how Nubank built its own in-house logging platform to address issues of cost, scalability, and control over their logging infrastructure. Initially reliant on a vendor solution, they found costs rising unpredictably and experienced limitations in observability and data retention.

To solve this, Nubank divided the project into two major steps: **The Observability Stream** (ingestion and processing) and the **Query & Log Platform** (storage and querying).

* **Observability Stream:** Fluent Bit for data collection, a Data Buffer Service for micro-batching, and an in-house Filter & Process Service.
* **Query & Log Platform:** Trino as the query engine, AWS S3 for storage, and Parquet for data format.

The new platform currently ingests 1 trillion logs daily, stores 45 PB of searchable data with a 45-day retention, and handles almost 15,000 queries daily. Nubank reports the platform costs 50% less than comparable market solutions while providing them with greater control, scalability, and the ability to customize features. The project underscored Nubank's value of challenging the status quo and leveraging a combination of open-source and in-house development.

2025-10-28 Tags: logging, nubank, observability, trino, aws s3, parquet, data, data engineering, production engineering, observability bus by klotz

k8s-1m Overview

An effort to create a fully functional Kubernetes cluster with 1 million active nodes. The article details the challenges and solutions for scaling Kubernetes to this size, covering networking, state management (etcd), and the scheduler.

2025-10-20 Tags: kubernetes, k8s, scalability, etcd, scheduler, networking, ben chess, github, tutorial, scale, production engineering by klotz

Why Do Transformers Fail to Forecast Time Series In-Context?

This paper provides a theoretical analysis of Transformers' limitations for time series forecasting through the lens of In-Context Learning (ICL) theory, demonstrating that even powerful Transformers often fail to outperform simpler models like linear models. The study focuses on Linear Self-Attention (LSA) models and shows that they cannot achieve lower expected MSE than classical linear models for in-context forecasting, and that predictions collapse to the mean exponentially under Chain-of-Thought inference.

2025-10-17 Tags: time series, forecasting, transformers, in-context learning, linear, self-attention, machine learning, statistical models, llm, production engineering by klotz

Prompt Engineering for Time-Series Analysis with Large Language Models

This article explores how prompt engineering can be used to improve time-series analysis with Large Language Models (LLMs), covering core strategies, preprocessing, anomaly detection, and feature engineering. It provides practical prompts and examples for various tasks.

2025-10-16 Tags: llm, prompt engineering, time series, forecasting, anomaly detection, feature engineering, data science, machine learning, production engineering, observability by klotz

Dozzle is the perfect self-hosted container monitoring and logging tool

Dozzle is a lightweight, self-hosted solution that provides a real-time look into your container logs, offering an intuitive UI, real-time logging, intelligent search, and support for multiple use cases like home labs and local development.

2025-09-26 Tags: dozzle, containers, monitoring, logs, self-hosted, docker, production engineering by klotz

Launch HN: Datafruit (YC S25) – AI for DevOps

2025-09-03 Tags: llm, devops, production engineering, hn, datafruit by klotz

TraceRoot.AI

TraceRoot.AI is an AI-native observability platform that helps developers fix production bugs faster by analyzing structured logs and traces. It offers SDK integration, AI agents for root cause analysis, and a platform for comprehensive visualizations.

2025-08-30 Tags: observability, traceroot.ai, debugging, logs, traces, root cause analysis, sdk, automation, monitoring, sre, devops, production engineering, hallux.ai by klotz

Find the Root Cause in Your Code's Trace

TraceRoot accelerates the debugging process with AI-powered insights. It integrates seamlessly into your development workflow, providing real-time trace and log analysis, code context understanding, and intelligent assistance. It offers both a cloud and self-hosted version, with SDKs available for Python and JavaScript/TypeScript.

2025-08-30 Tags: agent, debugging, monitoring, trace, observability, multi-agent-systems, llm, production engineering, devops, sre, hallux.ai, root cause analysis, github by klotz

The Missing Layer in AI Infrastructure: Aggregating Agentic Traffic

The article discusses the emergence of 'agentic traffic' – outbound API calls made by autonomous AI agents – and the need for a new infrastructure layer, an 'AI Gateway', to govern and secure this traffic. It outlines the components of an AI Gateway and the importance of security, compliance, and observability in managing agentic AI.

2025-08-25 Tags: gateway, agents llm, api gateway, infrastructure, security, observability, governance, production engineering by klotz

Logs, Metrics & Traces: A Before and After Story

The company's transition from fragmented observability tools to a unified system using OpenTelemetry and OneUptime dramatically improved incident response times, reducing MTTR from 41 to 9 minutes. By correlating logs, metrics, and traces through structured logging and intelligent sampling, they eliminated much of the noise and confusion that previously slowed root cause analysis. The shift also reduced the number of dashboards engineers needed to check per incident and significantly lowered the percentage of incidents with unknown causes.

Key practices included instrumenting once with OpenTelemetry, enforcing cardinality limits, and archiving raw data for future analysis. The move away from 100% trace capture and over-instrumentation helped manage data volume while maintaining visibility into anomalies. This transformation emphasized that effective observability isn't about collecting more data, but about designing correlated signals that support intentional diagnosis and reduce cognitive load.

2025-08-21 Tags: observability, opentelemetry, logs, metrics, traces, production engineering by klotz

SemanticScuttle - klotz.me

Tags: production engineering*

Linked Tags

Related Tags